Skip to content

A few micro-optimizations#6

Merged
calebzulawski merged 1 commit intocalebzulawski:masterfrom
TethysSvensson:master
Sep 11, 2019
Merged

A few micro-optimizations#6
calebzulawski merged 1 commit intocalebzulawski:masterfrom
TethysSvensson:master

Conversation

@TethysSvensson
Copy link
Copy Markdown
Contributor

This commit optimizes the generated code in two ways:

  • Less indirections: Before this commit there were multiple levels of wrappers calling each other. Since the wrappers had different different target_features enabled, they could not simply be inlined inside each other.
  • Branch-less dispatching: The dispatcher will no longer check for null and branch. Instead the loaded pointer is called directly. Initially the pointer is not set to null, but to a default handler that will update the atomic pointer before calling the resolved function.

Additionally I am not sure whether SeqCst semantics will actually result in any improved performance -- I think Relaxed semantics should be enough?

@TethysSvensson TethysSvensson changed the title Emit smaller dispatcher code A few micro-optimizations Sep 11, 2019
@calebzulawski
Copy link
Copy Markdown
Owner

Thanks for this, both __resolver_fn and marking the clones unsafe is a big improvement.

I think you're right about Relaxed ordering, especially since the pointer can only be one of two values and either are correct at any point in time.

@calebzulawski calebzulawski merged commit 44a8479 into calebzulawski:master Sep 11, 2019
@TethysSvensson
Copy link
Copy Markdown
Contributor Author

After sleeping on it I realized the reasoning behind one of those wrappers. After my commit, it is not possible to write this code:

#[target_clones("[x86|x86_64]+avx")]
pub fn bad_unsafe() {
    let x: Vec<u32> = std::mem::uninitialized();
}

I'll make a new PR that re-introduces one of the wrappers without causing worse generated output.

@gnzlbg
Copy link
Copy Markdown

gnzlbg commented Sep 17, 2019

@TethysSvensson I think the is_x86_feature_detected! macros of the standard library could benefit from a similar optimization to remove a branch. If you enjoy this a good place to start would be the rust-lang/stdarch repository, and in there, the std_detect sub crate.

It uses an AtomicUXY as a bitset, and there is a function that checks whether the bitset has already been initialized every time that one attempts to detect a feature, and if it isn't, it then initializes it. It should probably be possible to replace that function with a function pointer that, on the first call, initializes the cache, and then modifies the function pointer to one that does not do initialization, such that this is then always called all the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants